Prerequisites
The EDL Pipeline requires a Unix-like environment (Linux, macOS, or WSL on Windows) with Python 3.7+.
Windows Users : Use WSL (Windows Subsystem for Linux) or Git Bash. Native Windows Command Prompt may have issues with curl commands and path handling.
System Requirements
Python Version Python 3.7 or higher (tested on 3.8-3.11)
Disk Space Minimum 500 MB free (2 GB recommended for OHLCV data)
Network Stable internet connection (pipeline fetches 30+ MB of data)
Memory 4 GB RAM minimum (8 GB recommended)
Installation Steps
Verify Python Installation
Check that Python 3 is installed: Expected output: If Python is not installed, download from python.org or use your system’s package manager: # macOS (Homebrew)
brew install python3
# Ubuntu/Debian
sudo apt update && sudo apt install python3 python3-pip
# Fedora/RHEL
sudo dnf install python3 python3-pip
Install Python Dependencies
The pipeline requires three core Python packages: pip3 install requests pandas beautifulsoup4
Or use a requirements file: requests > =2.28.0
pandas > =1.5.0
beautifulsoup4 > =4.11.0
pip3 install -r requirements.txt
Package Version Purpose requests >=2.28.0 HTTP client for API calls to Dhan, NSE endpoints pandas >=1.5.0 OHLCV data processing, CSV parsing (NSE listings) beautifulsoup4 >=4.11.0 HTML parsing for surveillance lists (Google Sheets fallback)
Verify Installation
Confirm all dependencies are installed: python3 -c "import requests, pandas, bs4; print('All dependencies OK')"
Expected output:
Locate the Pipeline Directory
Navigate to the EDL Pipeline source code: cd "~/workspace/source/DO NOT DELETE EDL PIPELINE"
Verify the master runner script exists: ls -l run_full_pipeline.py
DO NOT DELETE or RENAME this directory . The folder name is intentionally explicit to prevent accidental removal. All pipeline scripts use relative paths and expect to run from this directory.
Verify Directory Structure
The pipeline directory should contain these core scripts: Expected output (18 core scripts): run_full_pipeline.py # Master runner
fetch_dhan_data.py # Phase 1: Core data
fetch_fundamental_data.py # Phase 1: Fundamentals
fetch_company_filings.py # Phase 2: Filings
fetch_new_announcements.py # Phase 2: Announcements
fetch_advanced_indicators.py # Phase 2: Indicators
fetch_market_news.py # Phase 2: News
fetch_corporate_actions.py # Phase 2: Corporate actions
fetch_surveillance_lists.py # Phase 2: ASM/GSM
fetch_circuit_stocks.py # Phase 2: Circuits
fetch_bulk_block_deals.py # Phase 2: Bulk deals
fetch_incremental_price_bands.py # Phase 2: Price bands
fetch_complete_price_bands.py # Phase 2: Price bands
fetch_all_ohlcv.py # Phase 2.5: OHLCV
bulk_market_analyzer.py # Phase 3: Base JSON
advanced_metrics_processor.py # Phase 4: Metrics
process_earnings_performance.py # Phase 4: Earnings
enrich_fno_data.py # Phase 4: F&O data
add_corporate_events.py # Phase 4: Events (LAST)
Optional/Standalone Scripts
These scripts are NOT part of the main pipeline but can be run manually: fetch_all_indices.py # 194 market indices
fetch_etf_data.py # 361 ETFs
fetch_fno_data.py # 207 F&O stocks
fetch_fno_lot_sizes.py # F&O lot sizes
fetch_fno_expiry.py # Expiry calendar
single_stock_analyzer.py # Single stock inspector
pipeline_utils.py # Shared utilities
Test Run (Dry Run)
Verify the pipeline can start without errors: python3 -c "import run_full_pipeline; print('Pipeline module loaded successfully')"
Or run a quick test with a single script: python3 fetch_dhan_data.py
This should create two files:
dhan_data_response.json (~5 MB)
master_isin_map.json (~200 KB)
Verify: ls -lh dhan_data_response.json master_isin_map.json
Directory Structure After First Run
After running the pipeline once, your directory will look like this:
With CLEANUP_INTERMEDIATE = True (Default)
With CLEANUP_INTERMEDIATE = False
DO NOT DELETE EDL PIPELINE/
├── run_full_pipeline.py
├── fetch_ * .py (18 scripts )
├── all_stocks_fundamental_analysis.json.gz # PRIMARY OUTPUT (2-4 MB)
└── ohlcv_data/ # OHLCV cache (if FETCH_OHLCV = True)
├── RELIANCE.csv
├── TCS.csv
└── ... (2,775 CSV files )
Recommended : Keep CLEANUP_INTERMEDIATE = True to save disk space. The compressed output contains all data needed for analysis.
Network Configuration
The pipeline makes HTTP requests to multiple endpoints:
Endpoint List (expand to see all)
Endpoint Purpose Rate Limit ow-scanx-analytics.dhan.coFull market scan, corporate actions Thread pool: 1 open-web-scanx.dhan.coFundamental data Thread pool: 1 ow-static-scanx.dhan.coFilings, announcements, indicators, deals Thread pool: 15-50 news-live.dhan.coReal-time news feed Thread pool: 15 openweb-ticks.dhan.coOHLCV historical data Thread pool: 15 nsearchives.nseindia.comListing dates, price bands Direct curl Google Sheets (fallback) Surveillance lists Direct requests
Firewall/Proxy Users : Ensure outbound HTTPS (port 443) is allowed for:
*.dhan.co
nsearchives.nseindia.com
docs.google.com (for surveillance list fallback)
Validation Checklist
Before running the full pipeline, verify:
Python Dependencies
python3 -c "import requests, pandas, bs4; print('✅ All dependencies OK')"
Network Connectivity
curl -s -o /dev/null -w "%{http_code}" https://ow-scanx-analytics.dhan.co
Expected: 200 or 405 (endpoint exists)
Disk Space
df -h . | tail -1 | awk '{print $4 " available"}'
Ensure at least 500 MB free (2 GB if using OHLCV)
Write Permissions
touch test.json && rm test.json && echo "✅ Write permission OK"
Troubleshooting Installation
ModuleNotFoundError: No module named 'requests'
Cause : Dependencies not installed in the correct Python environment.Solution :# Ensure you're using the same python3 binary
which python3
# Install with explicit python3 pip
python3 -m pip install requests pandas beautifulsoup4
# Verify installation
python3 -m pip list | grep -E '(requests|pandas|beautifulsoup4)'
Permission Denied when running scripts
Cause : Scripts lack execute permissions.Solution :# Make scripts executable
chmod +x * .py
# Or run with python3 explicitly
python3 run_full_pipeline.py
curl: command not found (NSE CSV download fails)
Cause : curl not installed.Solution :# macOS: curl is pre-installed
# Ubuntu/Debian:
sudo apt install curl
# Fedora/RHEL:
sudo dnf install curl
# Verify:
curl --version
Impact if not fixed : Listing dates will be missing, but pipeline will continue (non-critical).
SSL Certificate Verification Failed
Cause : Corporate proxy or outdated CA certificates.Solution :# Update CA certificates
# Ubuntu/Debian:
sudo apt update && sudo apt install ca-certificates
# macOS:
/Applications/Python\ 3.X/Install \ Certificates.command
# Or temporarily disable SSL verification (NOT RECOMMENDED for production):
# Add to fetch scripts:
# response = requests.post(url, json=payload, headers=headers, verify=False)
Timeout errors during OHLCV fetch
Cause : Slow network or rate limiting.Solution :# First run: Expect 30-40 min for lifetime OHLCV download
# If timing out repeatedly, increase timeout in fetch_all_ohlcv.py (line ~50):
# timeout=30 → timeout=60
# Or skip OHLCV for faster pipeline:
# Edit run_full_pipeline.py:
FETCH_OHLCV = False
Virtual Environment (Optional but Recommended)
To isolate dependencies from system Python:
Create Virtual Environment
Activate Environment
# Linux/macOS:
source edl-env/bin/activate
# Windows (WSL):
source edl-env/bin/activate
Your prompt should now show (edl-env).
Install Dependencies
pip install requests pandas beautifulsoup4
Run Pipeline
python run_full_pipeline.py
Next Steps
Quick Start Guide Run your first pipeline and explore the output
Pipeline Settings Customize pipeline behavior
Pipeline Architecture Understand the pipeline phases
Field Reference Complete guide to all 86 output fields